Choosing Bucket Boundaries for Histograms
نویسندگان
چکیده
Histograms have long been used to capture attribute value distribution statistics for query optimizers. More recently, there has been a growing interest in the use of histograms to produce quick approximate answers to decision support queries. This motivates nding good strategies for specifying histogram buckets. Under the assumption that nding optimal bucket boundaries is computationally ineecient, previous research has focused on nding heuristics that produce good solutions. In this paper, we present an algorithm to determine bucket boundaries optimally, in time proportional to the square of the number of distinct data values, for a broad class of optimality metrics. Through experimentation, we show that optimal histograms can have substantially lower reconstruction error than histograms produced according to popular heuristics. We also present a new heuristic, based on our understanding of the optimal solution, which in many cases obtains lower reconstruction error than previously proposed heuristics, with a computation cost that is still quite low.
منابع مشابه
Optimal Histograms with Quality Guarantees
Histograms are commonly used to capture attribute value distribution statistics for query optimizers. More recently, histograms have also been considered as a way to produce quick approximate answers to decision support queries. This widespread interest in histograms motivates the problem of computing his-tograms that are good under a given error metric. In particular, we are interested in an e...
متن کاملPiecewise Linear Histograms for Selectivity Estimation
Selectivity estimation of queries is of critical importance to query optimization. In order to get accurate estimations, database management systems must maintain statistics to capture the underlying data distribution. Histograms are extensively used in commercial database systems for this purpose. Most current histogram techniques make the assumption that all values in a single bucket appear w...
متن کاملHistogram refinement for content-based image retrieval
Color histograms are widely used for content-based image retrieval. Their advantages are efficiency, and insensitivity to small changes in camera viewpoint. However, a histogram is a coarse characterization of an image, and so images with very different appearances can have similar histograms. We describe a technique for comparing images called histogram refinement, which imposes additional con...
متن کاملHistogram Re nement for Content - Based Image RetrievalGreg Pass
Color histograms are widely used for content-based image retrieval. Their advantages are eeciency, and insensitivity to small changes in camera viewpoint. However, a histogram is a coarse characterization of an image, and so images with very diierent appearances can have similar histograms. We describe a technique for comparing images called histogram re-nement, which imposes additional constra...
متن کاملA nearly optimal and deterministic summary structure for update data streams
We present a deterministic summary structure over update streams that enables deterministic and the first space-optimal algorithms for a variety of problems, including, estimating frequencies, finding approximate frequent items, finding approximate quantiles, finding hierarchical heavy hitters, approximately optimal B-bucket histograms, estimating inner product sizes, etc..
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007